The Tidyverse and Base R

MKTG 585R Quantitative Marketing Pre-PhD Seminar

Summarize Data

Summarize Discrete Data

The go-to statistic for a discrete variable is a count.

customer_data |> 
  count(region)
# A tibble: 4 × 2
  region        n
  <chr>     <int>
1 Midwest    1101
2 Northeast  3224
3 South      1111
4 West       5095

Summarize Continuous Data

The obvious statistic for a continuous variable is a mean.

customer_data |>
  summarize(avg_income = mean(income))
# A tibble: 1 × 1
  avg_income
       <dbl>
1    138623.

Note that summarize() is a more general version of count().

We can also compute the mode, median, variance, standard deviation, minimum, maximum, sum, etc.

Visualize Data

{ggplot2} provides a consistent grammar of graphics built with layers.

  1. Data – Data to visualize.
  2. Aesthetics – Mapping graphical elements to data.
  3. Geometry – Or “geom,” the graphic representing the data.
  4. Facets, Labels, Scales, etc.

Visualize Discrete Data

Plot our first summary (note how + is different from |>).

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill")

Facets

Facets allow us to visualize by another discrete variable. For example, is this relationship different depending on gender?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender)

Labels and Scales

It’s no longer a count on the y-axis. Let’s change the labels.

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  )

What about the legend? And these colors?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  ) +
  scale_fill_manual(
    name = "College Degree",
    values = c("royalblue", "darkblue")
  )

Visualize Continuous Data

Let’s plot the distribution of income.

customer_data |> 
  ggplot(aes(x = income)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualize the relationship between income and credit.

customer_data |> 
  ggplot(aes(x = income, y = credit)) +
  geom_point()

Visualize the relationship between star_rating and income.

customer_data |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_point()
Warning: Removed 7372 rows containing missing values (`geom_point()`).

What do we do if there is overplotting? There’s a geom for that.

customer_data |> 
  drop_na(star_rating) |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_jitter(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ region) +
  labs(
    title = "Relationship Between Star Rating and Income by Region",
    x = "Star Rating",
    y = "Income"
  )
`geom_smooth()` using formula = 'y ~ x'

Summarize Continuous and Discrete Data

Grouped summaries provide a powerful solution for computing continuous statistics by discrete categories.

customer_data |>
  group_by(gender) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  )
# A tibble: 3 × 4
  gender     n avg_income avg_credit
  <chr>  <int>      <dbl>      <dbl>
1 Female  5219    130685.       668.
2 Male    4214    146861.       666.
3 Other   1098    144735.       665.

Note how the group_by() function is a lot like the facet_wrap(), it filters the data by each category in the discrete group variable.

count() is a wrapper around a grouped summary using n().

customer_data |>
  group_by(gender) |>
  summarize(
    n = n()
  )
# A tibble: 3 × 2
  gender     n
  <chr>  <int>
1 Female  5219
2 Male    4214
3 Other   1098
customer_data |>
  count(gender)
# A tibble: 3 × 2
  gender     n
  <chr>  <int>
1 Female  5219
2 Male    4214
3 Other   1098

We can group by more than one discrete variable.

customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 5
# Groups:   gender [3]
   gender region        n avg_income avg_credit
   <chr>  <chr>     <int>      <dbl>      <dbl>
 1 Other  Midwest     124    154637.       663.
 2 Male   Midwest     420    152467.       666.
 3 Other  Northeast   337    150564.       665.
 4 Male   Northeast  1285    150498.       665.
 5 Male   West       2079    149453.       667.
 6 Other  West        519    144420.       667.
 7 Female Midwest     557    134083.       671.
 8 Female West       2497    133819.       668.
 9 Female Northeast  1602    133333.       669.
10 Other  South       118    119068.       660.
11 Male   South       430    117988.       669.
12 Female South       563    105888.       664.

We can also use slice_*() functions along with group_by().

customer_data |>
  group_by(gender, region) |>
  slice_max(income, n = 3)
# A tibble: 36 × 13
# Groups:   gender, region [12]
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        6119       1984 Female 315000   698. No      Yes            Midwest  
 2        1299       1992 Female 306000   610. No      Yes            Midwest  
 3        7139       1957 Female 302000   727. No      Yes            Midwest  
 4        1040       1993 Female 356000   672. Yes     Yes            Northeast
 5        6503       1997 Female 348000   599. No      Yes            Northeast
 6       11249       1992 Female 343000   620. No      Yes            Northeast
 7        2075       1989 Female 374000   578. No      Yes            South    
 8        7128       1966 Female 301000   708. Yes     Yes            South    
 9        4756       1977 Female 293000   790. No      Yes            South    
10        7366       1993 Female 376000   682. No      Yes            West     
# ℹ 26 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Visualize Continuous and Discrete Data

Just like there are geoms for visualizing continuous or discrete data, there are geoms for visualizing the relationship between continuous and discrete data.

customer_data |> 
  ggplot(aes(x = income, y = gender)) +
  geom_boxplot()
customer_data |> 
  ggplot(aes(x = income, fill = gender)) +
  geom_density(alpha = 0.5)

Tidy Data

Tidy Data and the Tidyverse

Tidy data is defined as follows:

  • Each observation has its own row.
  • Each variable has its own column.
  • Each value has its own cell.

This may seem obvious or simple, but this common philosophy is at the heart of the tidyverse. It also means we will often prefer longer datasets to wider datasets and {tidyr} will help us move between the two.

Pivot Longer

The most common problem with messy data is when column names are really values.

crm_data |> 
  select(region, customer_id, contains("2018"))
# A tibble: 10,531 × 14
   region    customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
   <chr>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 South            1001        1        0        0        0        0        4
 2 West             1002        0        0        0        0        0        0
 3 South            1003        4        0        0        0        0        0
 4 Midwest          1004        0        0        0        2        0        0
 5 West             1005        0        1        0        0        0        0
 6 Midwest          1006        0        0        0        0        0        0
 7 Midwest          1007        0        0        0        4        0        0
 8 South            1008        0        0        0        0        0        0
 9 West             1009        0        0        0        0        1        0
10 Northeast        1010        0        0        0        0        0        0
# ℹ 10,521 more rows
# ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
#   oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>

When column names are really values, the data frame ends up being wider than it should be. Use pivot_longer() to pivot the data frame longer by turning column names into values.

crm_long <- crm_data |>
  select(region, customer_id, contains("2018")) |> 
  pivot_longer(
    -c(region, customer_id),
    names_to = "month_year",
    values_to = "transactions"
  )

Note how much longer the data frame is and why.

crm_long
# A tibble: 126,372 × 4
   region customer_id month_year transactions
   <chr>        <dbl> <chr>             <dbl>
 1 South         1001 jan_2018              1
 2 South         1001 feb_2018              0
 3 South         1001 mar_2018              0
 4 South         1001 apr_2018              0
 5 South         1001 may_2018              0
 6 South         1001 jun_2018              4
 7 South         1001 jul_2018              0
 8 South         1001 aug_2018              0
 9 South         1001 sep_2018              0
10 South         1001 oct_2018              0
# ℹ 126,362 more rows

Now summarizing the transactions for 2018 by region is trivial.

crm_long |> 
  group_by(region) |> 
  summarize(
    total_transactions = sum(transactions),
    avg_transactions = mean(transactions)
  )
# A tibble: 4 × 3
  region    total_transactions avg_transactions
  <chr>                  <dbl>            <dbl>
1 Midwest                10092            0.764
2 Northeast              29958            0.774
3 South                  10501            0.788
4 West                   47375            0.775

Pivot Wider

If the data has the opposite problem and has values that should really be column names, use pivot_wider() to pivot the data frame wider by turning values into column names.

crm_long |> 
  pivot_wider(
    names_from = month_year,
    values_from = transactions
  )
# A tibble: 10,531 × 14
   region    customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
   <chr>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 South            1001        1        0        0        0        0        4
 2 West             1002        0        0        0        0        0        0
 3 South            1003        4        0        0        0        0        0
 4 Midwest          1004        0        0        0        2        0        0
 5 West             1005        0        1        0        0        0        0
 6 Midwest          1006        0        0        0        0        0        0
 7 Midwest          1007        0        0        0        4        0        0
 8 South            1008        0        0        0        0        0        0
 9 West             1009        0        0        0        0        1        0
10 Northeast        1010        0        0        0        0        0        0
# ℹ 10,521 more rows
# ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
#   oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>

Separate Columns

If two (or more) values are in one column, separate() the values into two (or more) columns.

crm_long <- crm_long |>
  separate(month_year, c("month", "year"), sep = "_")

crm_long
# A tibble: 126,372 × 5
   region customer_id month year  transactions
   <chr>        <dbl> <chr> <chr>        <dbl>
 1 South         1001 jan   2018             1
 2 South         1001 feb   2018             0
 3 South         1001 mar   2018             0
 4 South         1001 apr   2018             0
 5 South         1001 may   2018             0
 6 South         1001 jun   2018             4
 7 South         1001 jul   2018             0
 8 South         1001 aug   2018             0
 9 South         1001 sep   2018             0
10 South         1001 oct   2018             0
# ℹ 126,362 more rows

Now we can summarize the transactions for 2018 by month and region.

crm_long |> 
  group_by(month, region) |> 
  summarize(
    total_transactions = sum(transactions),
    avg_transactions = mean(transactions)
  ) |> 
  arrange(desc(avg_transactions))
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
# A tibble: 48 × 4
# Groups:   month [12]
   month region    total_transactions avg_transactions
   <chr> <chr>                  <dbl>            <dbl>
 1 nov   South                   1812            1.63 
 2 dec   Midwest                 1723            1.56 
 3 dec   West                    7922            1.55 
 4 dec   Northeast               4928            1.53 
 5 dec   South                   1693            1.52 
 6 nov   West                    7656            1.50 
 7 nov   Northeast               4824            1.50 
 8 nov   Midwest                 1646            1.50 
 9 apr   Midwest                  751            0.682
10 may   South                    749            0.674
# ℹ 38 more rows

Unite Columns

When two (or more) values should be in one column, unite() the values into one column.

crm_long |>
  unite("month_year", c(month, year), sep = "_")
# A tibble: 126,372 × 4
   region customer_id month_year transactions
   <chr>        <dbl> <chr>             <dbl>
 1 South         1001 jan_2018              1
 2 South         1001 feb_2018              0
 3 South         1001 mar_2018              0
 4 South         1001 apr_2018              0
 5 South         1001 may_2018              0
 6 South         1001 jun_2018              4
 7 South         1001 jul_2018              0
 8 South         1001 aug_2018              0
 9 South         1001 sep_2018              0
10 South         1001 oct_2018              0
# ℹ 126,362 more rows

Data Classes and Types

We’ve been using data frames (technically tibbles, the tidyverse data frame). A data frame is composed of columns called vectors. Both data frames and vectors are classes of data.

Each vector has a single data type. Data types include double (i.e., numeric), integer, date, character, and factor. Discrete data is often a factor, which includes both integer levels and character labels.

If we try to mix data types in a vector, it will pick the easiest one to satisfy.

Coercion

Sometimes we need to coerce a data class or type, for example as_tibble(). We can similarly coerce data types with as.*() functions (e.g., as.numeric() and as.character()).

However, coercing dates is very tricky (see supplementary material for details) while we often want to coerce factors using the fct_*() functions.

Base R

More Data Classes

A list is like a vector where each entry in the vector can be its own data class. It is the most general data class in R.

  • A vector is a list of values all of the same data type.
  • A data frame is a list of vectors with the same length.
  • We could have a list of vectors, data frames, and lists.

We also have matrices, the two-dimensional extension of vectors, and arrays, matrices with more than two dimensions.

Indexing

Indexing (subsetting, selecting) is about picking part of an object.

crm_data$gender[1]
[1] "Female"
crm_data[10:15, 3:4]
# A tibble: 6 × 2
  gender income
  <chr>   <dbl>
1 Female  77000
2 Male   122000
3 Female 126000
4 Female 197000
5 Male    49000
6 Male   347000
crm_data[1:10, c("birth_year", "gender")]
# A tibble: 10 × 2
   birth_year gender
        <dbl> <chr> 
 1       1971 Female
 2       1970 Female
 3       1988 Male  
 4       1984 Other 
 5       1987 Male  
 6       1994 Male  
 7       1968 Male  
 8       1994 Male  
 9       1958 Male  
10       1994 Female
crm_data[, -4]
# A tibble: 10,531 × 180
   customer_id birth_year gender credit married college_degree region    state
         <dbl>      <dbl> <chr>   <dbl> <chr>   <chr>          <chr>     <chr>
 1        1001       1971 Female   742. No      No             South     DC   
 2        1002       1970 Female   749. Yes     No             West      WA   
 3        1003       1988 Male     542. No      No             South     AR   
 4        1004       1984 Other    574. Yes     Yes            Midwest   MN   
 5        1005       1987 Male     644. No      Yes            West      HI   
 6        1006       1994 Male     554. Yes     Yes            Midwest   MN   
 7        1007       1968 Male     608. No      No             Midwest   MN   
 8        1008       1994 Male     710. No      No             South     KY   
 9        1009       1958 Male     702. No      No             West      NM   
10        1010       1994 Female   605. Yes     No             Northeast VT   
# ℹ 10,521 more rows
# ℹ 172 more variables: star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
#   mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
#   jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
#   nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
#   mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …
crm_list[[1]]
# A tibble: 10,531 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female  73000   742. No      No             South    
 2        1002       1970 Female  31000   749. Yes     No             West     
 3        1003       1988 Male    35000   542. No      No             South    
 4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
 5        1005       1987 Male    58000   644. No      Yes            West     
 6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
 7        1007       1968 Male    39000   608. No      No             Midwest  
 8        1008       1994 Male    69000   710. No      No             South    
 9        1009       1958 Male   233000   702. No      No             West     
10        1010       1994 Female  77000   605. Yes     No             Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Summary, Length, and Structure

The summary() function is a generic function that summarizes an object in a way that is (usually) appropriate for the type of data.

summary(customer_data$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18000   83000  139000  138623  187000  470000 
summary(crm_list)
     Length Class       Mode
[1,]  13    spec_tbl_df list
[2,] 169    spec_tbl_df list

Keep tabs on the dimensions of an object using length() and (more generally) str().

Missing Values

R treats missing values in a particular way: NA.

mean(customer_data$star_rating, na.rm = TRUE)
[1] 4.406141

Or the tidyverse way:

customer_data |> 
  drop_na(star_rating) |> 
  summarize(star_mean = mean(star_rating))
# A tibble: 1 × 1
  star_mean
      <dbl>
1      4.41

Plot

The plot() function is another generic function that handles an object in a way that is (usually) appropriate for the type of data.

plot(customer_data$income, customer_data$credit)

There are other flavors of plot().

hist(customer_data$income)

Base R plotting can be deceptively straightforward. Using {ggplot2} can feel very verbose in comparison, but it’s also more explicit. The actual plot that’s created is stored very differently. While {ggplot2} adds layers onto a PNG, you can layer different plot()-adjacent functions to draw ontop of the previous plot.

For Loops

If you have to copy and paste something more than twice you should consider coding the iteration. This is usually in the form of a for loop (though you don’t have to write it, see apply() and map()).

empty_vector <- vector(mode = "double", length = 7)
for (i in seq_along(empty_vector)) {
  empty_vector[i] <- 1 + i
}

empty_vector
[1] 2 3 4 5 6 7 8

Conditional Statements

Sometimes you want code to run only when certain conditions are met. To do this, use conditional statements. Note that these are a separate idea from for loops, but a for loop can be a great place to use them!

empty_vector <- vector(mode = "double", length = 7)
for (i in seq_along(empty_vector)) {
  if (i == 1) {
    empty_vector[i] <- 1
  } else {
    empty_vector[i] <- 1 + empty_vector[i - 1]
  }
}

empty_vector
[1] 1 2 3 4 5 6 7

Functions

While a for loop is a powerful way to code an iteration and conditional statements give us even more flexibility, we might still need to copy and paste more than twice. What we often want is a function.

times_y <- function(x, y = 2) {
  x * y
}

times_y(2)
[1] 4
times_y(x = 2, y = 4)
[1] 8

A few things to note:

  • The arguments can include defaults.
  • Variables defined inside a function aren’t automatically accessible outside and vice versa.
  • The last line is automatically returned, or we can use return() to get back a specific output.
  • Since a function has its own environment, debugging is trickier. Call debugonce() and then call the function.

Assignment 1

Prepare a five-to-ten minute presentation on the paper you’ve selected and read. Note that if you want include links the notation is [link name](URL) and an image is ![alt text](file path). You can also use $s around math or $$s around blocks of math.

Finally, if you modify the header you can specify where any figures you create will be saved (this will be especially important in future assignments).

---
title: "Paper Title"
format: revealjs
knitr:
  opts_chunk:
    fig.path: "../Figures/"
---

The name of the figures will match the label of the code block in which it was created (i.e., #| label: fig-ppd).